Back

Algorithms for Molecular Biology

Springer Science and Business Media LLC

Preprints posted in the last 30 days, ranked by how well they match Algorithms for Molecular Biology's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.

1
Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites

Large, A. L.; Holmes, I. H.

2026-05-19 bioinformatics 10.64898/2026.05.16.725674 medRxiv
Top 0.2%
0.3%
Show abstract

The TKF92 model of molecular evolution--a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions--is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction -- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs-Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way -- exact closed-form sufficient statistics for the linear birth-death-immigration component, the resolvable LHopital limit at{lambda} =, and a closed-form M-step for a recursive generalisation of TKF92 -- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry [~]0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.

2
PARiS: Probabilistic Assignment and Repartitioning of isomiR Sequences: A data-driven method for denoising isomiR read count data

Swan, H. K.; Baran, A. M.; Aparicio-Puerta, E.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-12 bioinformatics 10.64898/2026.05.09.723882 medRxiv
Top 0.3%
0.1%
Show abstract

MicroRNAs (miRNAs) are non-coding RNAs, approximately 18-24 nucleotides in length, with important gene regulatory functions. In small RNA sequencing (sRNA-seq), observed isoforms of miRNA, called isomiRs, arise from my biological and technical processes. Alterations in isomiR expression has been linked to a wide variety of human diseases, from cancers to neurological diseases. However, it is difficult to distinguish between technical and biological isomiRs. We present PARiS, an algorithm for the Probabilistic Assignment and Repartitioning of isomiR Sequences, that identifies technical error isomiRs in sRNA-seq data and reassigns them to their most likely biological source. We assess the ability of PARiS to identify and remove error isomiR sequences in a realistic simulation study. Additionally, we compare PARiS to alternative approaches, focusing on downstream miRNA-level differential expression analysis in a variety of settings, including a set of simulated datasets, an experimental benchmark dataset, and three colorectal adenocarcinoma cell lines.

3
Steering Sequence Generation in Protein Language Models through Iterative Lookback Monte Carlo Sampling

Calvanese, F.; Lombardi, G.; Weigt, M.; FERNANDEZ-DE-COSSIO-DIAZ, J.

2026-05-07 bioinformatics 10.64898/2026.05.01.722156 medRxiv
Top 0.3%
0.1%
Show abstract

Protein language models (pLMs) leverage large-scale evolutionary data to generate novel sequences, but steering generation toward desired physicochemical properties without sacrificing diversity remains a major challenge. Existing approaches often induce severe diversity loss or require computationally expensive retraining. We introduce Iterative Lookback Monte Carlo (ILMC), a training-free inference-time sampling strategy that interleaves autoregressive elongation with Metropolis-Hastings refinement to approximate sampling from a maximum-entropy target distribution balancing generative quality and steering objectives. We show theoretically that this target distribution is entropy-maximizing under fixed generative quality and steering constraints, and empirically that ILMC produces more diverse samples than standard autoregressive baselines at matched generative quality. Using simple steering potentials, ILMC improves desired molecular properties, including generating proteins with up to 12{degrees}C higher predicted melting temperature than compute-matched alternative strategies. ILMC naturally applies to classifier-guided steering, where it outperforms purely autoregressive guidance in diversity while maintaining comparable enrichment of target properties. We validate ILMC on family-specific pLMs and on the multi-family model ProGen3.

4
HiCPEP: Efficient estimation of chromatin compartment PC1 from Hi-C covariance structure

Cheng, Z.-R.; Chang, J.-M.

2026-05-18 bioinformatics 10.64898/2026.05.14.725269 medRxiv
Top 0.3%
0.1%
Show abstract

Principal component analysis (PCA) of the Hi-C Pearson correlation matrix is the standard approach for identifying A/B chromatin compartments. Despite its widespread use, the relationship between the first principal component (PC1) and the underlying compartment structure remains insufficiently characterized, and computing PC1 can become computationally expensive for high-resolution Hi-C data. Here we investigate the role of the PC1 explained variance ratio in compartment analysis and show that chromosomes with strong compartment organization typically exhibit a dominant PC1 signal. Based on this observation, we propose HiCPEP, a heuristic algorithm that estimates the sign pattern and relative magnitude of PC1 directly from the Hi-C Pearson covariance matrix without performing explicit eigenvector decomposition. The method can operate from either a dense Pearson matrix for fast approximation or a sparse observed/expected (O/E) matrix to reduce memory usage. Furthermore, because many covariance columns exhibit PC1-like patterns when the compartment signal is strong, HiCPEP can be accelerated using random sampling without substantially reducing accuracy. Across multiple Hi-C datasets, HiCPEP consistently recovered compartment patterns with high similarity to reference PC1 vectors produced by standard PCA-based methods. Benchmark experiments show that HiCPEP achieves comparable accuracy while reducing computational cost in terms of runtime or memory usage. These results suggest that HiCPEP provides a practical alternative for efficient chromatin compartment analysis from large-scale Hi-C datasets. The HiCPEP implementation is freely available at https://github.com/ZhiRongDev/HiCPEP.

5
Efficient Stochastic Trace Generation for Transcription

Ferdowsi, A.; Fuegger, M.; Nowak, T.

2026-05-08 bioinformatics 10.64898/2026.05.05.722871 medRxiv
Top 0.3%
0.1%
Show abstract

Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespies algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.

6
DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism

Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725487 medRxiv
Top 0.3%
0.1%
Show abstract

Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.

7
Agentic systems are adept at solving well-scoped, verifiable problems in computational biology

Nair, S.; Gunsalus, L.; Orcutt-Jahns, B.; Rossen, J.; Lal, A.; Donno, C. D.; Celik, M. H.; Fletez-Brant, K.; Xie, X.; Bravo, H. C.; Eraslan, G.

2026-05-03 bioinformatics 10.64898/2026.04.06.716850 medRxiv
Top 0.3%
0.1%
Show abstract

We introduce CompBioBench, a benchmark of 100 diverse tasks for evaluating agentic systems in computational biology. Unlike mathematics and programming, which more readily admit systematic verification, biological data are inherently noisy and open to interpretation. To enable objective evaluation without reducing tasks to prescriptive checklists, we propose a new benchmark construction strategy based on synthetic/augmented data and metadata scrambling/scrubbing of real datasets to create challenging problems that have a single ground-truth answer and require multi-step reasoning, tool use, bespoke code, and interaction with real-world external resources. The benchmark spans genomics, transcriptomics, epigenomics, single-cell analysis, human genetics, and machine learning workflows. Questions are curated by domain experts to cover a broad range of skills with varying difficulty. We evaluate leading general-purpose agentic systems starting from a bare-minimum environment, requiring them to fetch data and tools as needed to solve each problem. We find strong end-to-end performance, with Codex CLI (GPT 5.4) reaching 83% accuracy, Gemini CLI (3.1 Pro) reaching 82%, Claude Code (Opus 4.6) reaching 81%, and Claude Code (Opus 4.7) reaching 78%. On the hardest questions, Claude Code (Opus 4.6) reaches 69%, Codex CLI (GPT 5.4) reaches 59%, and Gemini CLI (3.1 Pro) reaches 49%. CompBioBench provides a practical testbed for measuring the progress of agentic systems in computational biology and for guiding future benchmark design. Data and a public leaderboard are available at https://huggingface.co/collections/Genentech/compbiobench-v1.

8
MeiCOfi: Meiotic CrossOver Finder in haploid, diploid, polyploid and hyper-recombinant genomes

Fuentes, R. R.; Fernandes, J. B.; Susanto, T.; Wang, Y.; Underwood, C. J.

2026-05-04 bioinformatics 10.64898/2026.04.29.721680 medRxiv
Top 0.3%
0.1%
Show abstract

During the meiotic cell division, homologous chromosomes pair and recombine, leading to large reciprocal exchanges of genetic information. In most species, meiotic crossovers (COs) are crucial for normal chromosome segregation and they generate genetic diversity, which can be acted upon by natural selection in wild populations or by breeders to combine desirable traits in a genome. Identifying the position and frequency of COs is therefore essential in both classical genetics studies and breeding programmes. However, a computational tool capable of accurately detecting COs across diverse contexts, including varying marker densities, genome size and structure, recombination rate, and ploidy, remains lacking. We developed MeiCOfi (Meiotic CrossOver Finder) to detect meiotic crossover events at high-resolution from low-coverage genome sequencing data. We evaluated it using data from Arabidopsis thaliana, rice, barley and both intra- and inter-specific tomato hybrids, encompassing a wide range of genome complexities and marker densities. It reliably detects crossovers in hyper-recombinant A. thaliana with up to 62 CO per backcross offspring and in haploid gametes from barley with sequencing coverage as low as 0.1x. It can identify crossovers in polyploid genomes, including simulated recombinant tetraploids and also real data from tetraploid tomato hybrid offspring. Our results demonstrate that MeiCOfi can robustly identify crossovers in diverse genomic contexts.

9
Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp

Landis, J. T.; Love, M. I.

2026-05-11 bioinformatics 10.64898/2026.05.06.721669 medRxiv
Top 0.4%
0.1%
Show abstract

Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or columnwise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxps efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.

10
mehari: high-performance, strict HGVS-first variant effect prediction

Hartmann, T. F.; Zhao, M. X.; Beule, D.; Holtgrewe, M.

2026-05-14 bioinformatics 10.64898/2026.05.12.724271 medRxiv
Top 0.4%
0.1%
Show abstract

Variant annotation requires the precise and consistent computation of Sequence Ontology (SO) terms and Human Genome Variation Society (HGVS) nomenclature. To ensure robust synchronization between these two key facets, we present mehari, a high-performance variant effect predictor implemented in Rust that employs a strict "HGVS-first" approach. By deterministically projecting variants to transcripts before evaluating functional consequences, mehari structurally aligns HGVS notation and SO terms. Benchmarking on ClinVar demonstrates that mehari achieves exceptional processing speeds and high concordance with established tools like Ensembl VEP, while also providing refined handling for complex biological edge cases such as selenoprotein recoding.

11
Min-frame transformation enables more sensitive viral genome alignment

Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.

2026-05-22 bioinformatics 10.64898/2026.05.20.726535 medRxiv
Top 0.4%
0.0%
Show abstract

MotivationMaximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. MethodsTo address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a k-mer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. ImpactThe MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis. FundingTandy Warnow: NSF grant 2316233 Todd J. Treangen: NSF grants 2126387, 2239114, NIH grants U19-AI144297, P01-AI152999

12
Cell-Level Virtual Screening

Ellington, C. N.; Addagudi, S.; Wang, J.; Lengerich, B. J.; Xing, E. P.

2026-05-13 bioinformatics 10.64898/2026.05.11.724149 medRxiv
Top 0.5%
0.0%
Show abstract

Virtual screening methods prioritize therapeutic candidates by predicting molecular properties and interactions. However, molecular models are insufficient to predict higher-order effects that arise in real biological systems, leading to late-stage failures in drug discovery. Virtual cells have been posed as a solution to this problem by predicting gene expression responses to drugs, but they remain weakly validated as screening tools; gene expression is only an intermediate in understanding drug success or failure. Despite burgeoning progress in virtual cells, some basic questions remain. Is expression even a good representation of higher-order drug effects? How can expression and other cell-level representations be applied to prioritize therapeutic candidates? Can cell-level methods be fairly compared against traditional molecular-level screens? We address these questions in a two-pronged approach. First, we curate two benchmarks, Drug-Disease Retrieval Bench (DDR-Bench) and Drug-Target Retrieval Bench (DTR-Bench), which directly compare cell-level methods against traditional molecular methods on canonical drug discovery tasks. DDR-Bench evaluates a methods ability to prioritize disease indications for drugs with novel target profiles. DTR-Bench evaluates a methods ability to reconstruct drug-target interactions from separate perturbation modalities that act on shared mechanisms, bridging the gap between cell-level methods and classic molecular screens. We identify shortcomings of existing screening methods on these benchmarks, and propose an alternative representation of drug effects: perturbed gene networks. Inferring post-perturbation gene networks on-demand for unseen drugs requires methods that generalize beyond traditional plug-in network estimators. We develop a scalable differentiable surrogate loss for multivariate Gaussians, which we apply to train a context-adaptive amortized estimator that maps perturbation metadata to gene-gene dependency network parameters. The resulting model, CellVS-Net, achieves SOTA on predicting how gene networks restructure under a variety of complex multivariate experimental conditions, including different cell types, small molecule therapeutics, signaling molecules, gene knockdowns, and gene over-expressions. When compared to other molecular and cell-level representations of drugs, we find that CellVS-Net achieves SOTA on both virtual screening benchmarks. Overall, CellVS-Net demonstrates that cell-level virtual screening methods are a viable alternative to molecular screening, and associated benchmarks enable hill-climbing on relevant drug discovery tasks.

13
Mantis-Delta: Mass-Action Network Theory and Steady-State Characterization for Chemical Reaction Networks

Venegas Hernandez, E. A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725189 medRxiv
Top 0.5%
0.0%
Show abstract

Chemical Reaction Network Theory (CRNT), developed by Horn, Jackson, and Feinberg, provides parameter-free structural theorems that constrain the asymptotic dynamics of mass-action systems irrespective of the numerical values of the rate constants. Despite the maturity of the theory, modern open-source implementations that combine CRNT structural analysis with symbolic ordinary differential equation (ODE) construction and robust numerical steady-state finding remain scarce. We present mantis-delta, a pure Python library that ingests human-readable reaction strings, builds the complex reaction graph, computes the deficiency{delta} = n-{ell}-s and weak reversibility, and decides applicability of the Deficiency Zero Theorem (DZT) and Deficiency One Theorem (D1T). For systems satisfying these structural conditions, mantis-delta certifies, without any simulation whatsoever, existence, uniqueness and (for DZT) asymptotic stability of the positive steady state in every stoichiometric compatibility class. When the structural theorems do not apply, the library provides symbolic mass-action ODEs and Jacobians via SymPy and a hybrid numerical solver that combines stiff implicit integration with bound-constrained algebraic least-squares to locate both stable and unstable fixed points, including Hopf bifurcation centres inaccessible to forward integration. We demonstrate the workflow on six benchmarks: a reversible isomerisation, the Michaelis-Menten enzyme mechanism, the closed and chemostatted Brusselator, a catalytic hairpin assembly (CHA) miR-21 biosensor, and the Goldbeter-Koshland zero-order ultrasensitivity switch. In each case, the CRNT-predicted qualitative behaviour (monostability, oscillation, uniqueness) is recovered numerically with a residual below 10-6 M s-1, and the Goldbeter-Koshland dose-response curve agrees with the closed-form quasi-steady-state approximation to within 1% over a 400x kinase/phosphatase activity scan. mantis-delta is open-source (MIT license) and available at https://github.com/emiliovenegas/mantis-delta.

14
Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics 10.64898/2026.05.07.723486 medRxiv
Top 0.5%
0.0%
Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

15
BAT: an integrated pipeline for gene tree construction, annotation, and functional inference

Sheppard, B. D.; Behnken, B.; Steinbrenner, A.

2026-05-12 bioinformatics 10.64898/2026.05.07.721474 medRxiv
Top 0.5%
0.0%
Show abstract

Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.

16
A Rarefaction Approach to Identify Local Introgression in a Three Population Tree

Smith, T. Q.; Szpiech, Z. A.

2026-05-16 evolutionary biology 10.64898/2026.05.13.724952 medRxiv
Top 0.5%
0.0%
Show abstract

Pattersons D statistic, also known as the ABBA-BABA statistic, is widely used to detect the presence of archaic genome-wide introgression between two non-sister taxa. Requiring only a single lineage from each of four taxa where one taxon acts as an outgroup to determine the ancestral allele, Pattersons D, counts the imbalance between the number of biallelic sites where either the second and third taxa (ABAB site) or the first and third taxa (BABA site). When there is no introgression, these counts are expected to be equal, and a discordance between counts suggests introgression from the third taxon into either the first or second. Pattersons D is limited to the detection of genome-wide introgression and exhibits a high false-positive rate when applied to smaller genomic segments. Here, we present a new method, D STatistic with Allelic Rarefaction (D*), to address these limitations. D* uses multiple lineages and does not require an outgroup to calculate the imbalance between the number of alleles found exclusively in the second and third taxa and the number of alleles found exclusively in the first and third taxa. D* employs a rarefaction technique to correct for unequal sample-size and allows multiallelic sites. We use simulations to show that D* has better precision and recall for detecting introgressed segments of DNA when compared to similar methods under a wide variety of model parameters and in the presence of technical artifacts common to ancient DNA analyses. We conclude with an analysis of Denisovan DNA introgression in modern day Papuans. Precompiled executables, the manual, and source code can be found at https://github.com/TQ-Smith/DSTAR

17
Counterfactual Explanations for Graph Neural Networks in Patient Outcome Prediction

Chaidos, N.; Dimitriou, A.; Calzi, H.; Casiraghi, E.; Stamou, G.; Valentini, G.

2026-05-20 bioinformatics 10.64898/2026.05.18.725906 medRxiv
Top 0.5%
0.0%
Show abstract

Counterfactual Explanation (CE) algorithms have been successfully applied to uncover the main factors driving computational diagnostic and prognostic predictions on tabular medical data. Recently, a new Network Medicine paradigm has been introduced for patient diagnosis and prognosis using Patient Similarity Networks (PSNs), i.e. graphs where patients are represented as nodes and their clinical and biomolecular similarities as edges. In this context, graph-based algorithms, including Graph Neural Networks (GNNs), can provide predictions using not only individual patient features but also their relations within a network of clinically and biomolecularly similar individuals. In this work, we propose the first CE algorithm tailored to explain diagnostic and prognostic predictions within PSNs. Alongside a contrastive GNN backbone, we introduce a versatile, model-agnostic counterfactual search method compatible with any underlying classifier. Preliminary results on synthetic data and on a cohort of patients affected by the Alzheimers disease show that our algorithm is competitive both with seminal tabular based CE algorithms and GNNExplainer, a well-established method for explaining graph-based classification tasks.

18
Culsma: A Formal Language for Laboratory Protocols

Chen, Y.; Sun, M.; Tadepally, L.; Wang, J.; Barcenilla, H.; Gonzalez, L.; Brodin, P.

2026-05-12 bioinformatics 10.64898/2026.05.07.723509 medRxiv
Top 0.5%
0.0%
Show abstract

The application of artificial intelligence to biomedical research increasingly depends on iterative cycles in which AI systems analyze experimental data, propose follow-up conditions, and drive automated execution at scale, a paradigm central to Bio-AI and autonomous laboratory science. For such cycles to operate, laboratory protocols must be expressed in a form that is simultaneously human-readable and machine-executable. Natural-language descriptions, the current standard in laboratory practice, do not satisfy this dual requirement. We present Culsma, a formal language and execution framework that elevates laboratory protocols from informal prose to semantically explicit workflow programs that can be analyzed, validated, executed, and transferred across settings. The same protocol can be read and verified by a bench scientist, and parsed, validated, and executed by an automated pipeline without re-translation. We demonstrate an end-to-end implementation providing concrete evidence of practical viability.

19
Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics 10.64898/2026.05.05.723102 medRxiv
Top 0.5%
0.0%
Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.

20
A lightweight codon-based DNA Transformer for Regulatory Region Identification in the Genome

Karthik, A. S. P.; Das, A. B.

2026-05-07 bioinformatics 10.64898/2026.05.04.722647 medRxiv
Top 0.5%
0.0%
Show abstract

We developed a lightweight codon-based DNA Transformer equipped with multi-head self-attention and an adaptive classifier head, which achieves exon intron classification with high accuracy and also has moderate accuracy in CDS classification and splice site recognition. We named this model as ExIT (Exon-Intron Transformer). We have implemented codon tokenization for this model. This has been validated on the human genome with external validation from the chimpanzee genome. Further benchmarking has implied that our model is better than the existing models in the above tasks.